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Abstract 

Previous analytical studies of on-line Independent Component Analysis (ICA) learn- 
ing rules have focussed on asymptotic stability and efficiency. In practice the tran- 
sient stages of learning will often be more significant in determining the success of 
an algorithm. This is demonstrated here with an analysis of a Hebbian ICA algo- 
rithm which can find a small number of non-Gaussian components given data com- 
posed of a linear mixture of independent source signals. An idealised data model is 
considered in which the sources comprise a number of non-Gaussian and Gaussian 
sources and a solution to the dynamics is obtained in the limit where the number of 
Gaussian sources is infinite. Previous stability results are confirmed by expanding 
around optimal fixed points, where a closed form solution to the learning dynamics 
is obtained. However, stochastic effects are shown to stabilise otherwise unstable 
sub-optimal fixed points. Conditions required to destabilise one such fixed point 
are obtained for the case of a single non-Gaussian component, indicating that the 
initial learning rate t] required to successfully escape is very low (rj = 0{N~'^) 
where N is the data dimension) resulting in very slow learning typically requiring 
0{N'^) iterations. Simulations confirm that this picture holds for a finite system. 
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1 Introduction 



Independent component analysis (ICA) is a statistical modelling technique which 
has attracted a significant amount of research interest in recent years (for a review, 
see Hyvarinen, 1999). In ICA the goal is to find a representation of data in terms 
of a combination of statistically independent variables. This technique has a num- 
ber of useful applications, most notably blind source separation, feature extraction 
and blind deconvolution. A large number of neural learning algorithms have been 
applied to this problem, as detailed in the aforementioned review. 

Theoretical studies of on-line ICA algorithms have mainly focussed on asymp- 
totic stability and efficiency, using the established results of stochastic approxima- 
tion theory. However, in practice the transient stages of learning will often be more 
significant in determining the success of an algorithm. In this paper a Hebbian 
ICA algorithm is analysed and a solution to the learning dynamics is obtained in 
the limit of large data dimension. The analysis highlights the critical importance 
of the transient dynamics and in particular an extremely low initial learning rate is 
found to be essential in order to avoid trapping in a sub-optimal fixed point close 
to the initial conditions of the learning dynamics. 

This work focus es on the bigradient learning algorithm introduced by W ang 
and Karhunen (1996) and studied in the context of ICA by H yvarinen and Oja 
(1998) where it was shown to have nice stability conditions. This algorithm can 
be used to extract a small number of independent components from data of high 
dimension and is closely related to projection pursuit algorithms which detect "in- 
teresting" projections in high-dimensional data. The algorithm can be defined in 
on-line mode or can form the basis of a fixed-point batch algorithm which has been 
found to improve computational efficiency ( [Hyvarinen and Oja, 1997| ). In this work 
the dynamics of the on-line algorithm is studied. This may be the preferred mode 
when the model parameters are non-stationary or when the data set is very large. 
Although the analysis is restricted to a stationary data model, the results are rele- 
vant to the non-stationary case in which learning strategies are often designed to 
increase the learning rate when far from the optimum ( Miiller et al., 1998 ). The 
results obtained here suggest that this strategy can lead to very poor performance. 

In order to gain insight into the learning dynamics an idealised model is consid- 
ered in which data is composed of a small number of non-Gaussian source signals 
linearly mixed in with a large number of Gaussian signals. A solution to the dy- 
namics is obtained in the limiting case where the number of Gaussian signals is 
infinite. In this limit one can use techniques from statistical mechanics similar 
to those which have previously been applied to other on-line learning algorithms, 
including other unsupervised Hebbian learning algorithms such as Sanger's PCA 
algorithm (Biehl and Schlosser, 1998). For the asymptotic dynamics close to an 
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optimal solution the stability conditions due to Hyvarinen and Oja (1998| ) are con- 
firmed and the eigensystem is obtained which determines the asymptotic dynamics 
and optimal learning rate decay. However, the dynamical equations also have sub- 
optimal fixed points which are stable for any 0{N^^) learning rate where N is 
the data dimension. Conditions required to destabilise one such fixed point are 
obtained in the case of a single non-Gaussian source, indicating that learning must 
be very slow initially in order to learn successfully. The analysis requires a care- 
ful treatment of fluctuations which prove to be important even in the limit of large 
input dimension. Finally, simulation results are presented which indicate that this 
phenomenon persists also in finite sized systems. 



2 Data model 

The linear data model is shown in figure In order to apply the Hebbian ICA algo- 
rithm one should first sphere the data, ie. linearly transform the data so that it has 
an identity covariance matrix. This can be achieved by standard transformations in 
a batch setting but for on-line learning an adaptive sphering algorithm, such as the 
one introduced by [Cardoso and Laheld (199^ ), could be used. To simplify matters 
it is assumed here that the data has already been sphered. Without loss of general- 
ity it can also be assumed that the sources each have unit variance. The sources are 
decomposed into M non-Gaussian and N — M Gaussian components respectively. 



p{s) 



M 

liPiiSi) 

i=l 



p{n) 



N 

n 

=M+1 



e 2 



(1) 



To conform with the above model assumptions the mixing matrix A must be uni- 
tary. The mixing matrix is decomposed into two rectangular matrices Ag and A„ 
associated with the non-Gaussian and Gaussian components respectively. 
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The unitary nature of A results in the following constraints. 
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X = AsS + Anil 

Figure 1 : The linear mixing model generating the data is shown above. There are 
M non-Gaussian independent sources s while the remaining N — M sources n are 
uncorrelated Gaussian variables. There are N outputs x formed by multiplying the 
sources by the square non-singular mixing matrix A = [As An] . The outputs are 
linearly projected onto the K-dimensional vector y = W'^x. It is assumed that 
M « iV and K « N. 



3 Algorithm 

The f ollowing on-line Hebbian learning rule was introduced by W ang and Karhunen 



(1996) and analysed in the context of ICA by Hyvarinen and Oja (1998[ ), 



-W^ = 1] ax^(p{y*f + aW\l - (W^fW^) , (5) 
where 4>{y'^)i = 4>{y\) is an odd non-linear function which is applied to every 



component of the K-dimensional vector y = W x. The first term on the right 
is derived by maximising the non-Gaussianity of the projections in each compo- 
nent of the vector y. The second term ensures that the columns of W are close 
to orthogonal so that the projections are uncorrelated and of unit variance. The 
learning rate r/ is a positive scalar parameter which must be chosen with care and 
may depend on time. The parameter a is less critical and setting a = 0.5 seems to 
provide reasonable performance in general. The diagonal matrix a has elements 
fjj G {—1,1} which ensure the stability of the desired fixed point. The elements 
of this matrix can either be chosen adaptively or can be chosen according to d pri- 
ori knowledge about the source statistics. Stability of the optimal fixed point is 
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ensured by the condition ( Hyvarinen and Oja, 1998[ ), 



an = Sign((si0(si) - (t>'{si))) , (6) 

assuming we order indices such that yi — > itsj for i < mm{K, M) asymptotically. 
The angled brackets denote an average over the source distribution. 

A remarkable feature of the above algorithm is that the same non-linearity can 
be used for source signals with very different characteristics. For example, both 
sub-Gaussian and super-Gaussian signals can be separated using either = 
or cl){y) = tanh(y), two common choices of non-linearity. 



4 Dynamics for large input dimension 

Define the following two matrices, 

R = W^As , Q = W^W . (7) 

Using the constraint in equation (||) one can show that, 

y = W'^iAsS + Ann) 

= Rs + z where z Af{0,Q - RR^) . (8) 

Knowledge of the matrices R and Q is therefore sufficient to describe the relation- 
ships between the projections y and the sources s in full. Although the dimension 
of the data is N, the dimension of these matrices is K x M and K x K respec- 
tively. The system can therefore be described by a small number of macroscopic 
quantities in the limit of large N as long as K and M remain manageable. 

In appendix ^ it is shown that in the limit N ^ oo, Q ^ I while R evolves 
deterministically according to the following first order differential equation, 

^ = fia- ({(t>{y)s'^) - U^iy)y^ + ycl>iyf)R) - W{cj>{y)cj>{yf)R (9) 



dr 

with rescaled variables t = t/N and ^ = Nrj. This deterministic equation is only 



valid for i? = 0(1) and a different scaling is considered in section 4.2, in which 
case fluctuations have to be considered even in the limit. The brackets denote ex- 
pectations with respect to the source distribution and z ~ AA(0, / — RR^). The 
bracketed terms therefore only depend on R and statistics of the source distribu- 
tions, so that the above equation forms a closed system. 
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4.1 Optimal asymptotics 



The desired solution is one where as many as possible of the K projections mirror 
one of the M sources. If < M then not all the sources can be learned and 
which ones are learned depends on details of the initial conditions. \f K > M 
then which projections mirror each source also depends on the initial conditions. 
For K > M there will be projections which do not mirror any sources; these will 
be statistically independent of the sources and have a Gaussian distribution with 
identity covariance matrix. 

Consider the case where j/j — > Sj for i = 1 . . . m.m{K, M) asymptotically (all 
other solutions can be obtained by a trivial permutation of indices and/or changes 
in sign). The optimal solution is then given by R*j = 6ij which is a fixed point 
of equation (^ as /i — > 0. Asymptotically the learning rate should be annealed in 
order to approach this fixed point and the usual inverse law annealing can be shown 
to be optimal subject to a good choice of prefactor, 
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The asymptotic solution to equation (^ with the above annealing schedule was 
given by Leen et al. (1998). Let Uij = Rij — R^j be the deviation from the fixed 
point. Expanding equation (^ around the fixed point one obtains, 
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Here, tq is the time at which annealing begins and Xij and Vijki are the eigenvalues 
and eigenvectors of the Jacobian of the learning equation to first order in jj.. These 
are written as matrices and tensors respectively rather than scalars and vectors be- 
cause the system's variables are in a matrix. One can think of pairs of indices (i, j) 
and (A;, /) as each representing a single index in a vectorised system. The explicit 
solution to the eigensystem is given in appendix ^ From the eigenvalues, defined 
in equation (28), it is clear that the fixed point is stable if and only if the condition 
in equation (^) is met. 

There is a critical learning rate, ^q" = — 1/A„,ax> where A,^aj is the largest 
eigenvalue (smallest in magnitude, since the eigenvalues are negative), such that 
if /io < /iff" then the approach to the fixed point will be slower than the optimal 
1/r decay. From the eigenvalues given in (28) we find that A,nax = — Cmm where 
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= aii{si4>{si) - 4>'{si)), so that /i^"' = l/^,„„. As long as fiQ > fJ-o" then the 
terms involving tq in equations and (|l^) will be negligible and the asymptotic 
decay will be independent of the initial conditions and transient dynamics. 

Assuming /xq > /^o" ^rid substituting in the explicit expressions for the eigen- 
system we find the following simple form for the asymptotic dynamics to leading 
order in 1/r, 

4.2 Escape from the initial transient 

Unfortunately, the optimal fixed points described in the previous section are not the 
only stable fixed points of equation @. In some cases the algorithm will converge 
to a sub-optimal solution in which one or more potentially detected signals remain 
unlearned and the corresponding entry in R decays to zero. The stability of these 
points is due to the 0{^'^) term in equation which becomes less significant as 
the learning rate is reduced, in which case the corresponding negative eigenvalue 
of the Jacobian eventually vanishes. Higher order terms then lead to instability 
and escape from this sub-optimal fixed point. One can therefore avoid trapping by 
selecting a sufficiently low learning rate during the initial transient. 

Consider the simplest case where A' = Af = 1 in which case the matrix R 
reduces to a scalar: R = Rn and a = an. Expanding equation around = 0, 

^ = t,^R + o^iR" + 0{R') . (14) 

Here, ^4 is the fourth cumulant of the source distribution and the brackets denote 
averages over z ~ AA(0, 1). Although i? = is a stable fixed point, the range 
of attraction is reduced as /x ^ until eventually instability occurs. The condi- 
tion under which one will successfully escape the fixed point is found by setting 

d\R\/dt > 0, 

o- = SignK4(/) z , — . (15) 

Notice that the condition on a for escaping the initial transient is not generally 
the same as the condition in equation (^) which ensures stability of the asymptotic 
fixed point. For (/)(y) = the conditions are exactly equivalent. However, in other 
cases the conditions may conflict and an adaptive choice of a based on equation (^, 



as suggested by Hyvarinen and Oja (1998), may give poor results. With (f){y) 



tanh(y) the conditions appear to be equivalent in many cases. This condition is 



7 



equally applicable to gradient based batch algorithms since it is due to the 0(/x) 
term above, which is not related to fluctuations. 

If the entries in A and W are initially of similar order then one would expect 
R = 0{N~2). This is the typical case if we consider a random and uncorrelated 
choice for A and the initial entries in W . Larger initial values of R could only 
be obtained with some prior knowledge of the mixing matrix which we will not 
assume. With R = 0{N^^) the initial value of ^ required to escape is 0(A^^^), 
indicating a very slow initial phase in the dynamics (recall that the unsealed learn- 
ing rate rj = fi/N). Larger learning rates will result in trapping in the sub-optimal 
fixed point. However, the above result is not strictly valid unless R = 0(1), since 
this was an assumption used in the derivation of equation (^. When R = 0{N^2) 
then one can no longer assume that fluctuations are negligible as — > oo. Define 
the 0(1) quantities r = i?\/]V and u = -qN"^. The mean and variance of the 
change in r at each learning step can be calculated (to leading order in A^^^), 

E[Ar] ~ (-\{(j?{z))v'^r + \Ki{^"'{z))avr'^)N-'^, (16) 
Var[Ar] ~ {(l?{z))v'^N-^ . (17) 

This expression is derived by a similar adiabatic elimination of Qn as carried 
out for the deterministic case in appendix ^ This requires that a and rj are the 
same order before taking the limit N ^ oo, followed by the limit a ^ oo and 
corresponds to the usual form of "slaving" in Haken's terminology in which the 
eliminated variable only contributes to the change in mean. Other scalings may 
result in slightly different expressions (see, for example, Gardiner, 1985 , section 
6.6) although it is expected that the main conclusions described below will not be 
affected. 

The equation for the mean is similar to equation (p^. However, the variance is 
the same order as the mean in the limit — > oo and fluctuations cannot be ignored 
in this case. The system is better described by a Fokker-Planck equation (see, for 
example, Gardiner, 1985) with a characteristic time-scale of 0{N^). The system 



is locally equivalent to a diffusion in the following quartic potential, 

U{r) = \{ct?{z))u\'' - ^^\n^{cp"'{z))\vr^ , (18) 

with a diffusion coefficient D = {(lP'{z))v'^ which is independent of r. The shape 
of this potential is shown in figure ^ A potential barrier At/ must be overcome 
to escape an unstable state close to = (assuming that the condition on a in 
equation ([T^) is satisfied). 

For large v this system corresponds to an Ornstein-Uhlenbeck process with 
a Gaussian stationary distribution of fixed unit variance. Thus, if one chooses too 
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Figure 2: For r = Ry N = 0{1) the dynamics can be represented as a diffusion in 
a symmetric quartic potential U (r). Tiie escape time from an unstable fixed point 
at r = is mainly determined by the potential barrier At/. 



large v initially the dynamics will become localised close to i? = 0. As is reduced 
the potential barrier confining the dynamics is reduced. The time-scale for escape 
for l arge v is mainly determined by the effective size of the barrier (G ardiner, 
1985), 

T.,.o<exp(— ) =exp(^^ii^±A_j . (19) 

As the learning rate is reduced so the time-scale for escape is also reduced. How- 
ever, the choice of optimal learning rate is non-trivial and cannot be determined by 
considering only the leading order terms in R as above, because although small v 
will result in a quicker escape from the unstable fixed point near R = Q this will 
then lead to a very slow learning transient after escape. 

From the above discussion one can draw two important conclusions. Firstly, 
the initial learning rate should be 0{N^'^) or less initially in order to avoid trapping 
close to the initial conditions. Secondly, the time-scale required to escape the initial 
transient is 0{N^), resulting in an extremely slow initial stage of learning. 



4.3 Other sub-optimal fixed points 

In studies of other on-line learning algorithms, such as Sanger's rule and back- 
propagation, a class of sub-optimal fixed points have been discovered which are 
due to symmetries inherent in the learning machine's structure (Saad and Solla, 
1995a^ Biehl and Schwarze, 1995 ; Biehl and Schlosser, 1998 ). These symmetric 
fixed points are unstable for small learning rates, but the eigenvalues determining 
escape are typically of very small magnitude so that trapping can occur if the initial 
conditions are sufficiently symmetric. In practice this will typically occur only 
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for very large input dimensions {N > 10 ) and will result in learning timescales 
of 0{N'^) for 0{N^^) learning rates. Equation @ does exhibit fixed points of 
this type for particular initial conditions. Consider the case A' = M = 2 as an 
example. If initially Ru ~ R21 and R12 ~ R22 then the dynamics will preserve 
this symmetry until instabilities due to slight initial differences lead to escape from 
an unstable fixed point. This symmetry breaking is necessary for good performance 
since each projection must specialise to a particular source signal. 

As mentioned above, sufficiently small differences in the initial value of the en- 
tries in R will typically only occur for very large A^, much larger than the typical 
values currently used in ICA. A very small learning rate is then required to avoid 
trapping in a fixed point near the initial conditions, as discussed in the previous 
section. This initial trapping is far more serious than the symmetric fixed point 
since it requires a learning rate of 0{N~'^) in order to escape, resulting in a far 
greater loss of efficiency. In practice, symmetric fixed points do not appear to be 
a serious problem and we have not observed any such fixed points in simulations 
of finite systems. This may be due to the highly stochastic nature of the initial 
dynamics, in which fluctuations are large compared to the average dynamical tra- 
jectory. This is in contrast to the picture for back-propagation, for example, where 
fluctuations result in relatively small corrections to the average trajectory (Barber 
et al., 1996). The strong fluctuations observed here may help break any symmetries 
which might otherwise lead to trapping in a symmetric fixed point, although a full 
understanding of this effect requires careful analysis of the multivariate diffusion 
equation describing the dynamics near the initial conditions. 

5 Simulation results 

The theoretical results in the previous section are for the Umiting case where N — > 
00. In practice we should verify that the results are relevant in the case of large but 
finite N. In this section simulation evidence is presented which demonstrates that 
the trapping predicted by the theory occurs in finite systems. 

Figures ||(a)-(c) show results produced by an algorithm learning a single pro- 
jection from 100-dimensional data with a single non-Gaussian (uniformly dis- 
tributed) component {N = 100, M = K = \). The matrices A and W are 
randomly initialised with orthogonal, normalised columns. Similar results are ob- 
tained for other random initialisations. A cubic non-linearity is used and a is set 
to —1, although the adaptive scheme for setting a suggested by Hyvarinen and Oja 
(1998) gives similar results. In each example, dashed lines show the maxima of 
the potential in figure ^ Figure ^(a) shows the learning dynamics from a single 
run with i] = 10~^ {v = 0.1). The dynamics follows a relatively smooth tra- 
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jectory in this case and much of the learning is determined by the cubic term in 
equation (|l^). With this choice of learning rate there is a strong dependence on 
the initial conditions, with larger initial magnitude of R often resulting in signif- 
icantly faster learning. However, recall that a high value for R cannot be chosen 
without prior knowledge of the mixing matrix. Figure |3|(b) shows the learning dy- 
namics with a larger learning rate t] = 10~^ {u = 1) for exactly the same initial 
conditions and sequence of data. In this case the learning trajectory is more obvi- 
ously stochastic and is initially confined within the unstable sub-optimal state with 
i? ~ 0. Eventually the system leaves this unstable state and quickly approaches 
i? ~ 1. In this case the dynamics is not particularly sensitive to the initial mag- 
nitude of R although the escape time can vary significantly due to the inherent 
randomness of the learning process. In figure ||(c) the learning dynamics is shown 
for a larger learning rate t] = 4 x 10~'^ {v = 4). In this case the system remains 
trapped in the sub-optimal state for the entire simulation time. 

The analysis in section A2 is only strictly valid for the case of a single non- 
Gaussian source and a single projection. However, similar trapping occurs in gen- 
eral as demonstrated in figures ^d)-(f). The components of R are plotted for 
an algorithm learning two projections from 100-dimensional data with two non- 
Gaussian (uniformly distributed) components (A^ = 100, M = K = 2). The dif- 
ferent learning regimes identified in the single component case are mirrored closely 
in the case of this two component model. 



6 Conclusion 

An on-line Hebbian ICA algorithm was studied for the case in which data com- 
prises a linear mixture of Gaussian and non-Gaussian sources and a solution to 
the dynamics was obtained for the ideahsed scenario in which the number of non- 
Gaussian sources is finite while the number of Gaussian sources is infinite. The 



analysis confirmed the stability conditions found by Pyvarinen and Oja (1998| ) and 
the eigensystem characterising the asymptotic regime was determined. However, it 
was also shown that there exist sub-optimal fixed points of the learning dynamics 
which are stabilised by stochastic effects under certain conditions. The simplest 
case of a single non-Gaussian component was studied in detail. The analysis re- 
vealed that typically a very low learning rate (r? = 0{N^'^) where is the data 
dimension) is required to escape this sub-optimal fixed point, resulting in a long 
learning time of 0{N^) iterations. Simulations of a finite system support these 
theoretical conclusions. 

The sub-optimal fixed point studied here has some interesting features. In 
the limit — > the dynamics becomes deterministic and fluctuations due to the 
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stochastic nature of on-line learning vanish. In this case the sub-optimal fixed point 
is unstable but the Jacobian is zero at the fixed point (in the 1 -dimensional case) 
indicating that one must go to higher order to describe the dynamics. Standard 
methods for describing the dynamics of on-line algorithms have all been devel- 
oped in the neighborhood of fixed points with negative eigenvalues and are not 
applicable in this case ( [Heskes and Kappen, 1993 ). Furthermore, stability of the 



fixed point is induced by fluctuations. This is contrary to our intuition that fluc- 
tuations may be beneficial, resulting in quicker escape from sub-optimal fixed 
points. In the present case one has precisely the opposite effect: stochasticity sta- 
bilises an otherwise unstable fixed point. In similar studies of on-line PCA (Biehl 



and Schlosser, 1998) and back-propagation algorithms ( |Biehl and Schwarze, 1995| ; 



Saad and SoUa, 1995a ^ sub-optimal fixed points have been found which are also 



stabilised when the learning rate exceeds some critical value. However, the scale 
of critical learning rate stabilising these fixed points is typically 0{N~^), much 
larger than in the present case. Also, the resulting time-scale for learning is 0{N'^) 
with a very small prefactor (in practice an 0{N) term will dominate for realistic 
N). These fixed points reflect saddle points in the mean flow while here we have 
a flat region and escape is through much weaker higher order effects. This type 
of sub-optimal fixed point is more reminiscent of those which have been found 
in studies of small networks, which often have a much more dramatic effect on 
learning efficiency ( [Heskes and Wiegerinck, 19^ ). 



It is presently unclear whether on-line ICA algorithms based on Maximum- 
likelihood and Information-theoretic principles (see, for example, Amari et al., 
1996; Bell and Sejnowski, 1995 ; Cardoso and Laheld, 1996| ) exhibit sub-optimal 



fixed points similar to those studied here. These algorithms estimate a square de- 
mixing matrix and will require a different theoretical treatment than for the projec- 
tion model considered here, since there may be no simple macroscopic description 
of the system for large N. 
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A Derivation of the dynamical equations 

From equation (^ one can calculate the change in R and Q (defined in (^) after a 
single learning step, 

Ai? = r]a(f){y)s^ +a{I -Q)R, 

AQ = r]a{I + a{I-Q))ct){y)y^ + 7]ay(f>iyfiI + aiI-Q)) 

+2a{I - Q)Q + a^{I - QfQ + rf ci){y)x^ xci){y)^ . (20) 

Here, the definition in equation (^ and the constraint in equation (Q) have been 
used to set x'^Ag = s^. One can obtain a set of differential equations in the limit 
N ^ oo using a statistical mechanics formulation which has previously been ap- 



plied to the dynamics of on-line PCA algorithms (Biehl, 1994; Biehl and Schlosser, 



1998) as well as other unsupervised and supervised learning algorithms (see, for 
example, Biehl and Schwarze (1995[); Saad and SoUa ( 1995a and contributions 



in |Saad (1998 )). To obtain differential equations one should scale the parameters 
of the learning algorithm in an appropriate way, in particular rj = n/N. Typi- 
cally one chooses a = 0(1) but in order to obtain an analytical solution it is more 
convenient to choose a = uq/N before taking ^ oo and then take the limit 
ao — > oo. The dynamics do not appear to be sensitive to the exact value of a as 
long as a ^ r/ and it is therefore hoped that the dynamical equations are valid 
for a = 0(1) which is usually the case. The learning rate is taken to be constant 
here but the dynamical equations are also valid when the learning rate is changed 



slowly, as suggested for the annealed learning studied in section X. 1 
As iV ^ oo one finds. 



dR 

dQ 

dr 



/.o-((/)(y)sT) + ao(I-Q)fi, (21) 
ficT{cl){y)y^ + ycl){yf) + fi^{ct>{y)ct>{y f) + 2a^Q{I - Q),(22) 
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where r = t/A^ is a rescaled time parameter. The angled brackets denote averages 
over y as defined in equation (||). In deriving the above equations one should check 
that fluctuations in R and Q vanish in the limit N —>■ oo. This relies on an assump- 
tion that R = 0(1) which may not be appropriate in some cases. For example, 
in section a sub-optimal fixed point is analysed where it is more appropriate to 
consider R = 0{\/\fN) and a more careful treatment of fluctuations is required. 

As ao is increased, Q approaches /. If one sets Q — I = q/ao and make the 
d priori assumption that q = 0(1) then, 

— ^ = fiCT{cf>iy)y^ + y<^(y)T) + ^2(0(y)0(y)T) _ 2g + 0(1/^0) . (23) 
As ao — > 00 one can solve for q, 

g = i (/.^(0(2/)yT + ycf^iyf) + fM^{<p{y)<p{yf)) , (24) 

which is consistent with the d priori assumption. Substituting this result into equa- 
tion ( pT] ) leads to equation (^ in the main text. This is an example of adiabatic 



elimination of fast variables ( [Gardiner, 1985| , section 6.6) and greatly simplifies 
the dynamical equations. 



B Eigensystem of asymptotic Jacobian 

The Jacobian of dR/dr as /i ^ is defined (divided by /i), 



J, 



d 



1 dR 



ijkl 



dR, 



kl 



fj, dr 



fi=0/ 



(25) 



This is a tensor rather than a matrix because the system's variables are in a matrix. 
One can think of pairs of indices (i, j) and {k, I) as each representing a single index 
in a vectorised system. If the dynamics is equivalent to gradient descent on some 
potential function then the above quantity is proportional to the (negative) Hessian 
of this cost function. The Jacobian is not guaranteed to be symmetric in the present 
case, so this will not be possible in general. From equation (pb one obtains. 



with. 



(Tii{si(j){si) - (p'isi)) for i < min(i^,M) , 
otherwise . 



(26) 
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One must solve the following eigenvalue problem, 

^ ^ Jijkl^klnm ~ ^nmMijnm i (27) 
kl 

where \ij and VkHj are the eigenvalues and eigenvectors respectively. A solution 
is required for dX\i< K and j < M in order to get a complete set of eigenvalues, 

\i = ~2.^j , Vklii = 5ik5ii , 

^ij = -^i^i + ^j): Vkiij = dikSji - SjkSil fori<j<K, 

^ij = -^i ! Vkiij = SikSji for j > K , 

^ij = + ^j) > ^kiij = ^AkSji + ^jSjkSii for i> j . (28) 
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Figure 3: lOO-dimensional data (N = 100) is produced from a mixture containing 
a small number of uniformly distributed sources. Figures on the left (a-c) are for a 
single non-Gaussian source and a single projection (M = K = 1) while figures on 
the right (d-f) are for two non-Gaussian sources and two projections (M = K = 
2). Each column shows examples of learning with the same initial conditions and 
data but with different learning rates. From top to bottom: 77 = 10^^ {u = 0.1), 
rj = IQ-^ (1/ = 1) and 7? = 4 X 10"^ {u = 4). 
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